Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks #725

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

gabemontero
Copy link
Contributor

@gabemontero gabemontero commented Mar 8, 2024

Changes

So we ran with the change from #712 in our production system, and while we confirmed the results logging memory leak was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the #712 fixes have had a direct cause to the deadlock, or if another issue was encountered. Among other things our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side

rh-pre-commit.version: 2.2.0
rh-pre-commit.check-secrets: ENABLED

/kind bug

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

  • [n/a ] Has Docs included if any changes are user facing
  • [n/a ] Has Tests included if any functionality added or changed
  • [ y] Tested your changes locally (if this is a code change)
  • [y ] Follows the commit message standard
  • [ y] Meets the Tekton contributor standards (including functionality, content, code)
  • [y ] Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [ y] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • [n/a ] Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

Address deadlocked reconciler threads potentially hung while streaming logs

@tekton-robot tekton-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 8, 2024
@tekton-robot tekton-robot requested review from avinal and dibyom March 8, 2024 19:43
@tekton-robot tekton-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 8, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 63.9% -0.3

@gabemontero
Copy link
Contributor Author

OK 4cc4cab passed the e2e's

Next will in a separate commit attempt a more expansive use of context with timeout around all grpc calls from the dynamic reconciler.

@tekton-robot tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 8, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 64.1% -0.1

@gabemontero
Copy link
Contributor Author

ok @sayan-biswas I have 2 forms of a fix and/or work around for the results watcher hang we saw in our prod env today

the first commit just covers updating logs to the results api server

the second commit handles all grpc calls to the api server coming out of the dynamic reconciler

after some code review, it centers on the notion that we were blocked on an grpc call to the api server, where the api server may have been in a bad state, given I noticed an api restart around the time I recycled the watcher pod

So basically,

  1. by setting a timeout on the context used for all grpc calls out of the dynamic reconciler, we timeout and exit after 5 minutes if any of those calls are hung
  2. the second commit, in order to have 1 context cover both the log grpc and non-log gprc, has to move the call to streamLog back on the reconciler thread, as cancelling/cleaning up the parent context could inadvertently abort the UpdateLog call
  3. also employed the CloseAndRecv improvement on the UpdateLog call to again coordinate with cleaning up the context

}

return nil
}

func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType, logName string) error {
logger := logging.FromContext(ctx)
logsClient, err := r.resultsClient.UpdateLog(ctx)
streamCtx, streamCancel := context.WithTimeout(ctx, 5*time.Minute)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with the broader using of context timeout within the dynamic reconciler loop, we would get rid of this particular child context with timeout

@tekton-robot tekton-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 9, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 62.4% -1.8

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 61.1% -3.1

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 60.4% -3.8

@gabemontero
Copy link
Contributor Author

ok @sayan-biswas I now have a third commit that deciphers if the context or any specific GRPC calls exist because of deadline exceeded, and if so, dumps a goroutine with stack traces list to stdout for analysis.

So, with our hand on Friday, either the tkn client threading ended in a deadlock, or more likely, we had a grpc call to the api server hang (perhaps because the api server became messed up, which might have occurred since it restarted on its own).

In either case, we should get a thread dump in the log that we can diagnose.

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.9% -5.3

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.9% -5.3

@gabemontero gabemontero changed the title WIP: add timeout to streamLogs; switch from CloseSend to CloseRecv in streamLogs WIP: add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks Mar 15, 2024
@gabemontero gabemontero changed the title WIP: add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks add timeout to dynamic Reconcile; switch from CloseSend to CloseRecv in streamLogs; add goroutine dump on timeouts/deadlocks Mar 15, 2024
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 15, 2024
@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.5% -5.7

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 58.5% -5.7

@gabemontero
Copy link
Contributor Author

e2e's green with timeout full propagated, no go func for sendLogs

will update configuration of timeout now

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
cmd/watcher/main.go Do not exist 0.0%
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 56.5% -7.7

@gabemontero
Copy link
Contributor Author

@sayan-biswas @khrm @avinal @enarha this PR is ready for review, iterate, merge

// context with timeout does not work with the partial end to end flow that exists with unit tests;
// this field will alway be set for real
if r.cfg != nil && r.cfg.UpdateLogTimeout != nil {
dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dynamicContext, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)
ctx, dynamicCancel = context.WithTimeout(ctx, *r.cfg.UpdateLogTimeout)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm not a fan of using the generic variable name for this ... I was intentional in this name change

can you elaborate @khrm on why you want the name to be ctx ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone modifying the function later on can easily by mistake use ctx. ctx is still available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok you've swayed me ... will adjust

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed as a separate commit for now @khrm to facilitate your review .... I can squash once we are sync'ed on the change being ready

@@ -333,22 +414,21 @@ func (r *Reconciler) sendLog(ctx context.Context, o results.Object) error {
zap.String("name", o.GetName()),
)

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we pass context here and use select and ctx.done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because the end result is the same; we block the reconcile thread until the operation is complete

no reason for the added complextiy

@@ -396,6 +499,13 @@ func (r *Reconciler) streamLogs(ctx context.Context, o results.Object, logType,
Err: inMemWriteBufferStderr,
}, logChan, errChan)

// pull the first error that occurred and return on that; reminder - per https://golang.org/ref/spec#Channel_types
// channels act as FIFO queues
chanErr, ok := <-errChan
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since @khrm last review I've added here a simplified version of the error handling that @sayan-biswas used to do prior to pr #712

let's just return the first error seen; from what I see in the tkn code, multiple errors are not collected

with this though @sayan-biswas and the prior changes made with #712, plus moving the sendLogs call back on the reconciler thread, we now address one of the short comings of the old implementation, namely ignoring of errors and preventing retries on reconciliation.

Reminder: with the threadiness arg, users can adjust the number of threads when they enable log support.

I will add a separate PR to add the k8s client tuning shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#744 is the add tuning options PR

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.7% -8.5

@gabemontero
Copy link
Contributor Author

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9

Copy link
Contributor

@khrm khrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 4, 2024
@sayan-biswas
Copy link
Contributor

/approve

@tekton-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: khrm, sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sayan-biswas
Copy link
Contributor

sayan-biswas commented Apr 5, 2024

@khrm @sayan-biswas this is ready for review again .... once we are good, I can squash as needed .. thanks

Changes are fine.
Please go ahead and squash.

…next with timeout

switch from CloseSend to CloseRecv in streamLogs
add analysis, possible goroutine dump, to the context with timeout in dynamic reconclier

So we ran with the change from tektoncd#712 in our production system, and while we confirmed the results logging memory leak
was addresed, after about 12 to 13 hours, our reconciler threads systematically became deadlocked, and our watcher
quit processing events.

We as of yet have not been able to get a goroutine dump with stack traces when this problem occurs, so we are unclear
whether the tektoncd#712 fixes have had a direct cause to the deadlock, or if another issue was encountered.  Among other things
our api server container restarted during the watcher deadlock, where the previous pod logs gave no clear indicationas to why.

This change pulls a couple of potentially helpful bits to help either diagnose or work around the deadlock:
1) we have added a configurable timeout to the context used in the dynamic Reconcile method, in case a blockage in any RPC call using a context somehow was causing the problem
2) we also employ the combination of cancelling the context on method exit, to again unblock things, as well as the switch to
CloseAndRecv instead of CloseSend to confirm the UpdateLog finished, so that our canceling of the streamLog context does not
intermittenly cancel an UpdateLog call that would have otherwise succeeded.
3) we are analyzing how a context is released, and if it is from a timeout and not cancel, we initiate a goroutine dump with stack traces
4) using of a context with timeout that is canceled on exit from the reconcile method require no longer running 'sendLogs' on a separate goroutine, otherwise we re-introduced intermitent cancelling of 'UpdateLog' processing before it could complete.
5) we now log the dealines for UpdateLog on the api server side
6) we are back to pulling errors off of the tkn client error channel
@gabemontero
Copy link
Contributor Author

thanks @sayan-biswas @khrm commits are squashed

@tekton-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/api/server/v1alpha2/logs.go 68.3% 68.4% 0.1
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.3% -8.9

@sayan-biswas
Copy link
Contributor

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2024
@tekton-robot tekton-robot merged commit aa5bb71 into tektoncd:main Apr 8, 2024
6 checks passed
@gabemontero gabemontero deleted the possible-hang-diag branch April 12, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants